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Abstract 

The fertility of a soil is governed by potential of Hydrogen (pH) value of the 
soil. This research paper presents a novel approach for predicting the pH value 
of a soil by using RGB (Red, Green, Blue) values of an image. The study uti- 
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- as lizes machine learning techniques to develop a model that can accurately pre- 
a dict the soil pH based on the colour information captured in an image of the 
Prediction; soil. The model was trained with a dataset containing RGB and correspond- 
Machine Learning; ing pH value as the attributes and tested using a variety of images. Results 
Regression; show that the proposed model is able to predict soil pH with minimal error, 


Image processing. demonstrating the potential for using image analysis as a practical and effi- 


cient method for soil pH determination in agriculture and soil science. With 
the available dataset, various regression approaches have been implemented 
to predict the soil pH value, and eventually the experimental results shows that 
the polynomial regression is the most effective method as the data is not linear 
for analysing this dataset. 


1. INTRODUCTION 


India is one of the countries that has abundant 
nutrients land which can be used for agriculture 
and farming is a time-honoured profession that has 
been carried out by people since the beginning of 
recorded history. Agriculture is increasing popular- 
ity not only in rural communities, but also among the 
urban people who are massively investing their time 
and showing interest in farming due to ever growing 
demand of agricultural products to serve the needs 
of the world’s ever-increasing population. Agricul- 
ture was one of the fundamental forces that drove the 
industrial revolution, and the economy of a specific 
region. For the sake of ensuring agricultural sustain- 
ability, it is essential to gain an understanding of the 
long-term consequences of the many different meth- 
ods of soil management and to take significant care 
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of the soil quality. 


Similar to air and water, soil is a fundamental 
natural resource that offers a wide range of bene- 
fits to humans in the form of commodities and ser- 
vices provided by ecosystems. Soil can be defined as 
a loose surface substance composed of rock debris 
and organic elements. Leaching, weathering, and 
the activity of microbes all work together to produce 
the great diversity of soil types that exist today. Each 
form of soil has its own set of advantages and disad- 
vantages. 


Long ago, soil was discovered, but only relatively 
lately did people realize how critical it is to pre- 
serve and expand the range of services provided by 
ecosystems. pH value of the Soil is one of the major 
factors to be considered before doing any cultiva- 
tion (Barman). Soil pH is an important component 


Oo 
Nn 


Shivakoti, Reddy K and Reddy 


in crop health and productivity, as well as the gen- 
eral health of an ecosystem. The pH value of soils 
can be tested to determine whether they are naturally 
acidic or alkaline and optimal plant growth depends 
on the proper pH balance, which can be determined 
through testing. The acidity or basicity of soil is 
indicated by its pH, with a pH of 7 being neutral, val- 
ues less than 7 indicating acidity, and values greater 
than 7 suggesting basicity. A soil with 5.5 to 7.0 pH 
level is always good for cultivation (Barman et al.). 
Effective crop management and sustainable agricul- 
ture require accurate and efficient technologies for 
measuring and predicting soil pH. 

In general, farmers took the soil samples to a lab- 
oratory that specializes in testing soil pH or con- 
sulting soil pH colour charts. Sometimes a spe- 
cialist may assist the farmers in determining the pH 
value of the soil. However, obtaining the perspec- 
tives of experts is not always available in all sit- 
uations. Again, each of these approaches requires 
some amount of time, work, and specialized know]- 
edge. A soil pH chart is not an adequate method for 
determining the pH of soil since it requires human 
perception and the expertise of a trained profes- 
sional. Calculating the pH value of soil in the labo- 
ratory requires the use of a soil pH meter in addition 
to a soil colour pH card. The technique for using 
a pH meter on soil has taken longer than an hour 
for a relatively simple soil sample. Automation is 
becoming increasingly prevalent in day-to-day life 
as a result of advances in technology and increased 
computer usage. The process is not only speed up 
as a result, but the end product is also less prone to 
errors. Image processing and regression are going 
to be the two methods that will be used to achieve 
the goal of determining the pH of the soil (Barman 
et al.). 


The utilization of soil pictures, specifically 
through image processing and machine learning 
techniques is one of the promising strategies for 
forecasting soil pH. Regression analysis is a statis- 
tical tool for determining the relationship between 
one or more independent variables and a dependent 
variable (soil pH) (image features). By examining 
photographs of soil, elements such as colour, the pH 
value of the soil can be predicted. 

Images have significant advantages over tradi- 
tional approaches for predicting soil pH, such as soil 
sample and chemical analysis. For starters, it is a 
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non-destructive procedure, which means that no soil 
is disturbed in order to collect a measurement. Sec- 
ond, it is a more efficient way since photographs can 
be used to assess large regions of soil swiftly and 
simply. Third, it can produce a more accurate esti- 
mate of soil pH since it considers many soil proper- 
ties rather than depending on a single measurement. 
In this research work, regression analysis is used 
to predict the value of soil pH using photographs 
also experiments were conducted using various 
regression models such as linear regression, random 
forest regression, decision tree regression, MLP 
Regression, and polynomial regression for estimat- 
ing the value soil pH. For experimental work real- 
world data is used to evaluate the performance of 
these models and results are presented in this paper. 
The purpose of this study is to show the potential 
of using pictures to estimate soil pH and to give a 
foundation for future research in this area. 


2. Literature Review 


The soil pH was predicted using the RGB colour 
space of the photographs of the soil by [2,3, 4, 5, 6, 
7, and 8]. A digital camera is used to take pictures 
of the soil, and then equation-1 is used to make a 
prediction about the soil pH. 


Feature of Soil (pH Index) = Red/Green/Blue (1) 


The researchers Abu et al. (Abu, Nasir, and Bala) 
developed an expert system that makes use of fuzzy 
logic to regulate soil pH. During the procedure, 
adjusted the pH level of soil was adjusted in order 
to allow the farmers to replace the fertilizer and 
guarantee that the plant would have a high qual- 
ity. An approach that illustrates the soil’s physi- 
cal characteristics was presented by Babu and col- 
leagues (Babu and Pandian). They implemented the 
fractal dimension technology through the usage of 
LabView. 

In order to test the model, a 24-bit colour pho- 
tographs are used as input and then transformed 
those images to 8-bit, and then extracted the features 
by using an equation that was suggested by Kumar 
et al. (Kumar et al.). Aziz et al. (Aziz, Ahmed, 
and Abraham) employed the same red, green, and 
blue values of images that were given by Kumar et 
al. (Kumar et al.), and they fed those values into 
the neural network for the purpose of training and 
testing, for which an accuracy rate of 80% was 
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achieved by making use of ten hidden neurons in 
the hidden layer. Garibashvili and Mahantesh $.D 
et al (Gurubasava) determined the pH value of the 
soil based on the average of the RGB values of the 
photographs of the soil. The average values of RGB 
were compared to the actual pH of the soil, as well 
as their prediction of it. Barman et al. (Barman 
et al.) devised a method for predicting the soil pH 
by employing HSV colour image processing and 
regression techniques like linear, logarithmic, expo- 
nential, and quadratic predicted the soil pH in addi- 
tion to the computation of the hue, saturation, and 
value of the soil pictures. 


Sagar et al. (Sagar et al.) proposed a method for 
predicting the pH of soil in which photos are taken 
with the assistance of a camera and raspberry pi, and 
the pH of the soil is estimated by utilizing an algo- 
rithm applied to the obtained image after the image 
has been processed. Aside from soil pH, additional 
soil variables, such as soil moisture (Sagar et al. 
Matei et al. Pandey et al. Taheri-Garavand, Meda, 
and Naderloo), prediction of Azotobacterial popula- 
tion in soil (Ebrahimi et al.), soil mapping (Barman 
et al.), and soil organic matter (Mohan, Mridula, 
and Mohanan Ayoubi et al.) determined with the 
help of machine learning algorithms. The authors of 
these studies presented an analysis of the relation- 
ship between various soil parameters using ANN 
and regression. As a result of the relevant litera- 
ture, it has come to our attention that methods of 
machine learning can be utilized to forecast several 
soil parameters, one of which is soil pH. 


Anup Vibhute et al. (Vibhute and Koli) collected 
soil samples from different soil fields, took three 
photos of each sample, and then pre-processed the 
colour photographs to reduce noise. The results of 
their research may be found in the paper. Follow- 
ing the extraction of the various colour characteris- 
tics, such as RGB, Lab, HSV, and GLCM, a corre- 
lation was observed between the calculated features 
and the pH values. As a result, the regression anal- 
ysis that produced the best results, with a low Mean 
Square Error (MSE) and a high R? value, was com- 
puted. 


3. Proposed System 


A. Dataset description 


Dataset has been taken from Kaggle (ROBERT) 
in the form of a Comma Separated Value (CSV) file, 
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which has values of pH for various combinations of 
Red(R), Green(G), Blue(B) values of an image. As 
it is an image the values of R, G, B would be in the 
range of 0-255. This dataset consists of 653 rows 
and there are 4 attributes, namely blue, green, red 
and label(pH). Table 1 shows the sample RGB val- 
ues of the data set. 


TABLE 1. Sample RGB values of the data set 


Red(R) Green (G) Blue(B) 
231 2] 36 

250 84 36 

255 164 37 

255 205 22 

221 223 38 

148 214 29 

76 181 0 

0 156 13 

0 166 92 
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FIGURE 1. Scatter plot of RGB values and the 
corresponding pH values 


Figure | depicts the scatter plot Red, Green and 
Blue values, in the range of 0-255 and their corre- 
sponding pH (0-14) in the dataset, it can be easily 
inferred from the plot that the datapoints aren’t lin- 
ear, rather widely spread across length and breadth 
of the plot. 

B. Methodology 

Linear Regression 
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For this, we employed a technique known as lin- 
ear regression analysis. Linear regression, often 
known as the linear model, a statistical technique 
for predicting a numerical outcome variable (y). At 
least one predictor variable must be used in order 
to make this forecast (x). The goal is to provide an 
expression for the value of y as a function of x. Once 
a Statistically sound model has been built, it may be 
used to forecast the future using the updated x val- 
ues. 

Random Forest Regression 

Random Forest Regression algorithms are a type 
of Machine Learning technique that employs the 
usage of numerous random decision trees, each of 
which has been trained on a subset of data. The 
usage of several trees offers the algorithm stabil- 
ity and decreases variance. Because of its capacity 
to operate effectively with big and diverse datasets, 
the random forest regression technique is a popular 
model. Each tree is created by the algorithm using a 
distinct sample of input data. A different sample of 
characteristics is chosen for splitting at each node, 
and the trees operate in parallel with no interaction. 
The forecasts from each tree are then averaged to 
give a single outcome, which is the Random Forest 
prediction. 

Decision Tree Regression 

A Decision Tree Regressor is a type of supervised 
machine learning algorithm that is used for regres- 
sion tasks. It creates a model in the form of a tree 
structure, with internal nodes representing feature(s) 
on which the data is split and leaf nodes representing 
the output. The algorithm recursively splits the data 
into subsets based on the feature that results in the 
highest reduction in impurity (e.g., variance or mean 
squared error). The final output of the tree is the 
average target value of the training samples in the 
corresponding leaf node. Decision tree regressors 
are simple to understand and interpret and can han- 
dle both linear and non-linear relationships between 
features and target. 

Polynomial Regression 

Polynomial Regression is a form of regression 
analysis in which the relationship between the inde- 
pendent variables and dependent variables are mod- 
elled in the nth degree polynomial. Polynomial 
Regression is a subcategory of Linear Regression, in 
which a curvilinear relationship between the depen- 
dent and independent variables is assumed to exist 
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between the data points and the polynomial equation 
that is fit to the data. When there is no linear associ- 
ation between the variables, polynomial regression 
is the method that is utilized. 

MLP Regression 


A Multi-Layer Perceptron (MLP) Regressor is a 
type of artificial neural network that is used for 
supervised learning tasks, specifically for regression 
problems. It consists of multiple layers of artifi- 
cial neurons, with the input layer receiving the input 
data, and the output layer providing the predicted 
output. The layers in between, called hidden lay- 
ers, are used to learn complex representations of 
the input data. The MLP regressor is trained using 
a variant of the backpropagation algorithm, which 
adjusts the weights of the neurons in each layer in 
order to minimize the difference between the pre- 
dicted and actual output. 

Root Mean Squared Error 

The Root Mean Square Error (RMSE), also 
known as the root mean square deviation, is a pop- 
ular statistic used to evaluate the accuracy of a fore- 
cast shown in equation 2. Using Euclidean distance, 
it demonstrates how far the predicted values deviate 
from the actual values. 

Here, we calculate the residual (the difference 
between the prediction and the truth) for each data 
point, then the norm of the residuals, then the mean 
of the residuals, and finally the square root of the 
mean to get the root mean squared error. The RMSE 
should be kept as low as possible for the best mod- 
els. 


RMSE= SN |ly@- VPN @® 


where N is the number of data points, y(i) is the i’” 
measurement, and (i) is its corresponding predic- 
tion. 

Coefficient Of Determination(R? ) 

It is a representation of the squared correlation 
between the known values of the outcomes that have 
been observed and the values that have been pre- 
dicted by the model as shown in equation 3. The 
better the model, the higher the R? value should be. 
This correlation is represented as a value between 
0.0 and 1.0. A value of 1.0 indicates a perfect fit and 
is thus a highly reliable model for future forecasts, 
while a value of 0.0 would indicate that the calcula- 
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tion fails to accurately model the data at all. 


4 = (Dey) — (Le(Xy)/ 
[oda — (LPinLp — (LP (3) 


Where, 

n = Total number of observations, x = Total of 
the First Variable Value 

by = Total of the Second Variable Value, Uxy = 
Sum of the Product of first & Second Value 

»x? = Sum of the Squares of the First Value, Ny? 
= Sum of the Squares of the Second Value 

The coefficient of determination = (correlation 
coefficient)? = R? 

Firstly, the dataset was imported followed by 
divided the dataset into train and test datasets in the 
ratio of 0.8: 0.2 (80% train and 20%test dataset). 
Much of data pre-processing has not been done as 
the dataset seemed to be perfect without any null 
values nor any unnecessary attributes. The obtained 
data are evaluated both qualitatively and quantita- 
tively. The resulting values are also graphically 
illustrated. 

Secondly, the predictions were done using mul- 
tiple regression models namely Linear Regressor, 
Polynomial Regressor, Random Forest Regressor, 
Decision Tree Regressor and MLP Regressor, fur- 
ther the results have been graphically represented. 

Various parameters have been customized, dur- 
ing the training of the model. They include set- 
ting maximum depth for Random Forest Regres- 
sor has been set to 2, degree of 4 for polyno- 
mial regression and finally for MLP Regressor max- 
imum iterations were set to 50, early stopping 
being true and solver being ‘Limited-memory Broy- 
den—Fletcher—Goldfarb-—Shanno’, (which is usually 
used optimization algorithm for non-linear opti- 
mization problems). For both Linear Regression and 
Decision Tree Regressor default parameters have 
been used and no customizations have been done. 
The obtained RMSE and R2 score values for vari- 
ous regression models are shown in Table2. 

From the Table 2 it can be analysed that the 
highest score was achieved by Polynomial Regres- 
sor (95.60%) followed Decision Tree Regressor 
(94.84%) and Random Forest Regression (92.77%) 
being the 3 best performing models while MLP 
Regression and Random Forest Regression has 
achieved R? score of 89.24% and 75.26%. Figure 
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TABLE 2. Obtained RMSE and R? score values 
for various regression models 


Model RMSE R? Score 
Linear 2.21 0.75 
Regression 

Random For- 1.19 0.92 
est Regression 

Polynomial 0.93 0.96 
Regression 

Decision Tree 1.01 0.94 
Regression 

MLP Regres- 1.46 0.89 
sion 


2 and Figure 3 graphs represent the RMSE and R? 
score of selected regression models. 
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FIGURE 2. RMSE of proposed models 


From figure 3, the Decision Tree Regression and 
Random Forest Regression had an edge over Linear 
Regression which is because the data is not linear 
and can be easily inferred from figure 1, with the 
data being widely scattered, these algorithms out- 
performed linear regression. Table 3 represents the 
comparison of the proposed models with the exist- 
ing models. 


4. CONCLUSION 


The utilization of soil images, specifically through 
image processing and machine learning techniques 
is one of the promising strategies for forecasting the 
pH value of soil. Regression analysis is used in 
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FIGURE 3. R? score of proposed models 


Polynomial Regression 


FIGURE 4. Polynomial regression graph 
between pH and attribute ‘Red’ 


Polynomial Regression 


FIGURE 5. Polynomial regression graphbetween 
target and attribute ‘Green’ 


this paper for determining the relationship between 
one or more independent variables and a depen- 
dent variable (soil pH) (image features). Based on 
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Polynomial Regression 
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FIGURE 6. Polynomial regression graphbetween 
pH and attribute ‘Blue’ 


TABLE 3. Comparison of proposed models R? 
value with the existing models 


S.Ni Title R? 
1. Soil pH Determination Using Mobile 0.6 
Phone Captured Image 


2 Predication of soil pH using HIS 0.86 
colour image processing and regres- 
sion over Guwahati, Assam, India 

3. Prediction of Soil pH using Smart- 0.94 
phone based Digital Image Process- 
ing and Prediction Algorithm 

4. Determine the pH of Soil by using 0.80 
Neural Network Based on Soil’s 
Colour 

5. Proposed Model 0.96 


the findings the Polynomial regression model is the 
most effective. Linear regression is not possible to 
use on these data because it can’t be linearly sepa- 
rated and attempting to use the linear regression with 
this dataset resulted in the model not functioning 
very well. In addition, Polynomial regression was 
producing encouraging findings, with a root mean 
squared error of 0.93 being the absolute minimum. 
As a result, polynomial regression followed by ran- 
dom forest regressor are effective because it accu- 
rately predicts the pH value of soil. Figures 4,5 
and 6 are the polynomial regression graphs (between 
R, G and B attributes of the dataset with respect to 
pH) which clearly depicts how good the polynomial 
regression model fits with the data. 

The novelty of this research work lies in the usage 
of R, G, B values separately for training and using 
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an image during testing, where the R, G, B values 
of the image are extracted and sent to the model for 
pH predictions, whereas in the traditional pH pre- 
diction methods include both training and testing of 
model with image dataset. The proposed method has 
the potential to reduce the cost and time required for 
traditional soil testing methods. The study proposes 
a non-invasive method to predict soil pH using RGB 
values which can be easily measured by a camera 
which is also efficient and provides accurate results. 
Application incorporating this model is compatible 
to run on mobile devices which would make it easier 
for the users, especially farmers with minimal tech- 
nical knowledge, as it is matter of only clicking the 
image of soil in the app and submitting it for the pH 
predictions. Limitations do include the sample size 
used in the study. As it is relatively small which 
may limit the generalizability of the findings. The 
future scope of the current work will be focusing on 
improvement of the pH predictions by applying deep 
learning techniques. 
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